Performance: when things go wrong!

Prague PostgreSQL Meetup
2019-08-26

Who am I

  • Lætitia Avrot
  • PostgreSQL CoC Member
  • #PostgresWomen co-founder
  • EDB Senior Database Consultant
  • @l_avrot
Image by Anemone123 from Pixabay

Performance tuning investigation case

  • Context
  • What happened?
  • How we solved it
  • How we could have avoided it

Some context

  • Car sharing company in Paris
  • Black Friday Week-end
  • Saturday
  • 6 PM

What sets off alarm bells

  • No monitoring alert
  • Everything works!
  • Switchboard is overflooded

I'm blind

  • VPN is down
  • VPN fixed at 7:30PM
  • No "appropriate" logging
Image by Anemone123 from Pixabay/Users/laetitiaavrot/Downloads/blindfolded-1732522_1920.jpg

What you need to log

  • Checkpoints
  • Connections and disconnections
  • Lock waits
  • Temp files
  • Slow autovacuum
  • Slow queries
  • Default verbosity
Image by Dean Moriarty from Pixabay

What you need to log

log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_min_duration_statement = 0
log_error_verbosity = default
Image by Dean Moriarty from Pixabay

I can see!

  • Simple query
  • Over 250ms (but shouldn't)
  • Sent every 200 ms
  • No software change

Explanation

  • Planner is wrong
  • Gather statistics again → Good plan
  • Gather statistics again → Bad plan
  • Gather statistics again → Good plan
  • ➜ Planner is wronged by wrong statistics

Problem solved!

  • Plan is good (for now)
  • 10 PM
  • People don't need a car anymore
  • Leave it at it
Image by Alexas_Fotos from Pixabay

What happened ?

  • Bad code
  • Bad data management
  • Bad data model
Image by Free-Photos from Pixabay

Bad code

  • Station asks for car rental status…
  • ... too often!!

Bad data management

  • No data deletion…
  • … Ever!
  • 5 years of data or so
  • 200 million rows

Bad data model: Data type


  • Using NULL wrongly
  • Not using intervals
Image by Michael Gaida from Pixabay

Bad data model: Structure


  • History and current rentals
  • No partitioning
Image by Michael Gaida from Pixabay

The query

select count(*)
from rentals
where end_at is null
Image by Gerd Altmann from Pixabay

Bad data model: Indexes

What happened ?

It's all about…
STATISTICS
Image by Dean Moriarty from Pixabay

Statistics

  • pg_stats view
  • `most_common_vals`, `most_common_freqs`
  • `null_frac`
  • Fraction of column entries that are null
Image by Dean Moriarty from Pixabay

What can we do

  • Stop spammimg the query
  • Change the data model
  • Drop useless indexes
  • Partition data
  • No autovacuum tuning on that table
  • Purge some useless data
Image by 453169 from Pixabay

I want it all!!

...But...

  • Stop spammimg the query
  • Change the data model
  • Drop useless indexes
  • Partition data
  • Tune autovacuum
  • Purge some useless data

Let's "force" postgres into choosing the "good" plan
Image by Yvette Fang from Pixabay

Studying how the plan changes: tools

  • auto_explain
  • \watch [ seconds ]
  • Spreadsheets and Graphs

Image by 4956365 from Pixabay

How to "force" Postgres planner

  • You can't, but you can
  • controlling statistics gathering frequency
  • controlling statistics depth

Results

  • 200 millions rows, default_statistics_target=100
  • ➜ 20% of "bad" plans
  • 50 millions rows, default_statistics_target=1000
  • ➜ 0.1% of "bad" plans
Image by rawpixel from Pixabay

Better safe than sorry?

  • 0.1% is still "a lot"
  • watchguard to monitor that query performances
  • automatically analyze if too slow

Source of the problem

  • Junior devs
  • No ORM mastering
  • Poor database skills
  • Bad design
  • No old rows deletion policy
  • Too many indexes
  • No autovacuum tuning

Advices for "on call" DBAs

  • Be confident
  • Call a friend
  • Enjoy saving the world!
Image by Elias Sch. from Pixabay

Feel free to ask any question

Image parDean Moriarty de Pixabay